[Feature] Add zero bubble for spec v2#21895
Open
litmei wants to merge 48 commits intosgl-project:mainfrom
Open
[Feature] Add zero bubble for spec v2#21895litmei wants to merge 48 commits intosgl-project:mainfrom
litmei wants to merge 48 commits intosgl-project:mainfrom
Conversation
# Conflicts: # python/sglang/srt/model_executor/forward_batch_info.py # python/sglang/srt/speculative/eagle_worker_v2.py
…self_2 # Conflicts: # python/sglang/srt/environ.py # python/sglang/srt/model_executor/forward_batch_info.py
# Conflicts: # python/sglang/srt/model_executor/forward_batch_info.py # python/sglang/srt/model_executor/model_runner.py
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
In the current SGLang framework, the EAGLE3 Spec V2 implementation suffers from a CPU-side scheduling bottleneck. Specifically, the CPU dispatch process between consecutive decode steps is "bound" by the draft model's overhead. This creates execution bubbles that cannot be effectively hidden by the overlap scheduler. The goal of this PR is to refactor the scheduling logic to minimize or completely eliminate these CPU-originated bubbles.
Modifications
Asynchronous Data Transfer: Based on profiling results, all identified
to("cpu")operations have been made asynchronous using.pin_memory().to("cpu", non_blocking=True)to reduce synchronization stalls. See also this PR: Use pin_memory in forward_batch.init_new to reduce decoding latency#21360Scheduling Refactor & Draft Pre-execution:
prepare_for_verify, which handles input construction for verification or output restoration from the previous round.ForwardBatch.seq_lens_cpuduring the drafting phase.DeepSeek-V3.2, which do not rely onseq_lens_cpuduring the decode stage. For models likeQwen3that require these lengths, this change may affect the accepted length (causing it to fluctuate higher or lower).Rollback PR#21507, the native implementation leads to significant degradation in MTP scenarios.
Accuracy Tests
Theoretically, for dsa model, this feature will have absolutely no impact on its accuracy or accepted length, as seen in DeepSeek V3.2:
We need to investigate why DeepSeek-V3.2 didn't yield any gains here.
Before:
After:
Speed Tests and Profiling
TODO
H20 dsv3.2 (Layer pruning) prof
Before:
After:
Ascend A3 dsv3.2-w8a8 prof
Before:
After:
Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci